The Church and Hanks reading shows how interesting semantics can be found by looking at very simple patterns. For instance, if we look at what gets drunk (the object of the verb drink) we can automatically acquire a list of beverages. Similarly, if we find an informative verb in a text about mythology, and look at the subjects of certain verbs, we might be able to group all the gods' names together by seeing who does the blessing and smoting. More generally, looking at common objects of verbs, or in some cases, subjects of verbs, we have another piece of evidence for grouping similar words together.
Find frequent verbs: Using your tagged collection from the previous assignment, first pull out verbs and then rank by frequency (if you like, you might use WordNet's morphy() to normalize them into their lemma form, but this is not required). Print out the top 40 most frequent verbs and take a look at them:
In [1]:
import nltk
import re
from nltk.corpus import brown
In [2]:
import debates_util
In [3]:
debates = nltk.clean_html(debates_util.load_pres_debates().raw())
In [4]:
"""
Returns the corpus for the presidential debates with words tokenized by regex below.
"""
token_regex= """(?x)
# taken from ntlk book example
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens
"""
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
In [5]:
tokens = nltk.regexp_tokenize(debates, token_regex)
In [6]:
def build_backoff_tagger (train_sents):
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
return t2
tagger = build_backoff_tagger(brown.tagged_sents())
In [7]:
tags = tagger.tag(tokens)
In [8]:
sents = list(sent_tokenizer.sentences_from_tokens(tokens))
In [9]:
v_fd = nltk.FreqDist([t[0] for t in tags if re.match(r"V.*", t[1])])
v_fd.items()[50:100]
Out[9]:
Pick 2 out interesting verbs: Next manually pick out two verbs to look at in detail that look interesting to you. Try to pick some for which the objects will be interesting and will form a pattern of some kind. Find all the sentences in your corpus that contain these verbs.
In [10]:
defend_sents = [s for s in sents if "defend" in s]
[" ".join(s) for s in defend_sents[0:20]]
Out[10]:
In [11]:
help_sents = [s for s in sents if "help" in s]
[" ".join(s) for s in help_sents[0:20]]
Out[11]:
Find common objects: Now write a chunker to find the simple noun phrase objects of these four verbs and see if they tell you anything interesting about your collection. Don't worry about making the noun phrases perfect; you can use the chunker from the first part of this homework if you like. Print out the common noun phrases and take a look. Write the code below, show some of the output, and then reflect on that output in a few sentences.
In [12]:
np_chunker = r"""
VPHRASE: {<V.*><DT|AT|P.*|JJ.*|IN>*<NN.*>+}
"""
np_parser = nltk.RegexpParser(np_chunker)
In [13]:
t_defend = [tagger.tag(s) for s in defend_sents]
t_help = [tagger.tag(s) for s in help_sents]
In [14]:
c_defend = [np_parser.parse(s) for s in t_defend]
c_help = [np_parser.parse(s) for s in t_help]
In [15]:
fd_defend = nltk.FreqDist([" ".join(w[0] for w in sub[1:]) for t in c_defend for sub in t.subtrees() if sub.node=="VPHRASE" and sub[0][0].lower()=="defend"])
fd_defend.items()[0:10]
Out[15]:
In [16]:
fd_help = nltk.FreqDist([" ".join(w[0] for w in sub[1:]) for t in c_help for sub in t.subtrees() if sub.node=="VPHRASE" and sub[0][0].lower()=="help"])
fd_help.items()[0:10]
Out[16]:
Interesting results confirming many of my vague ideas of the things polticians have discussed in the past. A further modification would be to try and group the noun phrases together, for example this country and this nation refer to the same thing.
I would also like to do this split up by each debate to see the change in nouns from election to election.
In [17]:
from nltk.corpus import wordnet as wn
from nltk.corpus import brown
from nltk.corpus import stopwords
This code first pulls out the most frequent words from a section of the brown corpus after removing stop words. It lowercases everything, but should really be doing much smarter things with tokenization and phrases and so on.
In [18]:
def preprocess_terms():
# select a subcorpus of brown to experiment with
words = [word.lower() for word in brown.words(categories="science_fiction") if word.lower() not in stopwords.words('english')]
# count up the words
fd = nltk.FreqDist(words)
# show some sample words
print ' '.join(fd.keys()[100:150])
return fd
fd = preprocess_terms()
Then makes a very naive guess at which are the most important words. This is where some term weighting should take place.
In [19]:
def find_important_terms(fd):
important_words = fd.keys()[100:500]
return important_words
important_terms = find_important_terms(fd)
The code below is a very crude way to see what the most common "topics" are among the "important" words, according to WordNet. It does this by looking at the immediate hypernym of every sense of a wordform for those wordforms that are found to be nouns in WordNet. This is problematic because many of these senses will be incorrect and also often the hypernym elides the specific meaning of the word, but if you compare, say romance to science fiction in brown, you do see differences in the results.
In [20]:
# Count the direct hypernyms for every sense of each wordform.
# This is very crude. It should convert the wordform to a lemma, and should
# be smarter about selecting important words and finding two-word phrases, etc.
# Nonetheless, you get intersting differences between, say, scifi and romance.
def categories_from_hypernyms(termlist):
hypterms = []
for term in termlist: # for each term
s = wn.synsets(term.lower(), 'n') # get its nominal synsets
for syn in s: # for each synset
for hyp in syn.hypernyms(): # It has a list of hypernyms
hypterms = hypterms + [hyp.name] # Extract the hypernym name and add to list
hypfd = nltk.FreqDist(hypterms)
print "Show most frequent hypernym results"
return [(count, name, wn.synset(name).definition) for (name, count) in hypfd.items()[:25]]
categories_from_hypernyms(important_terms)
Out[20]:
Here is the question Modify this code in some way to do a better job of using WordNet to summarize terms. You can trim senses in a better way, or traverse hypernyms differently. You don't have to use hypernyms; you can use any WordNet relations you like, or chose your terms in another way. You can also use other parts of speech if you like.
In [39]:
def get_hypernyms(synsets, max_distance=100):
"""
Takes a list of synsets (as generated by wn.synsets) and returns a list of all hypernyms.
"""
hypernyms = set()
for synset in synsets:
for path in synset.hypernym_paths():
hypernyms.update([h for idx, h in enumerate(path) if h != synset and idx<=max_distance])
return hypernyms
def fd_hypernyms(fd, depth=None, min_depth=0, max_distance=100, pos=None):
"""
Takes a frequency distribution and analyzes the hypernyms of the wordforms contained therein.
Returns a weighted
fd - frequency distribution
depth - How far down fd to look
min_depth - A filter to only include synsets of a certain depth.
Unintuitively, max_depth is used to calculate the depth of a synset.
max_distance - The greatest distance a hypernym in the list can be from the synset.
pos - part of speech to limit sysnsets to
"""
hypernyms = {}
for wf in fd.keys()[0:depth]:
freq = fd.freq(wf)
hset = get_hypernyms(wn.synsets(wf, pos=pos), max_distance=max_distance)
for h in hset:
if h.max_depth()>=min_depth:
if h in hypernyms:
hypernyms[h] += freq
else:
hypernyms[h] = freq
hlist = hypernyms.items()
hlist.sort(key=lambda s: s[1], reverse=True)
return hlist
def concept_printer(concepts, n=20):
"Prints first n concepts in concept list generated by fd_hypernyms"
print "{:<20} | {:<10} | {}".format("Concept", "Concept Freq", "Definition")
print "===================================================================="
for s in concepts[0:n]:
print "{:<20} | {:<12.3%} | {}".format(s[0].lemma_names[0], s[1], s[0].definition)
In [78]:
concepts = fd_hypernyms(fd, depth=500, max_distance=4, min_depth=4)
concept_printer(concepts)